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(57) ABSTRACT 

A method is provided for generating a high dimensional 
density model within an acoustic model for one of a speech 
and a speaker recognition system. Acoustic data obtained 
from at least one speaker is transformed into high dimen- 
sional feature vectors. The density model is formed to model 
the feature vectors by a mixture of compound Gauissiaos 
with a linear transform, wherein each compound Gaussian is 
associated with a compound Gaussian prior and models each 
coordinate of each component of the density model inde- 
pendently by a imivariate Gaussian mixture comprising a 
univariate Gaussian prior, variance, and mean. An iterative 
expectation maximization (EM) method is applied to the 
feature vectors. The EM method includes the step of com- 
puting an auxiliary function Q of the EM method. 
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HIGH DIMENSIONAL ACOUSTIC large tasks such as automatic speech recognition, one 

MODELING VIA MDCTURES OF assumes only mixtures of Gaussians with diagonal covan- 

COMPOUND GAUSSIANS WITH LINEAR ances. There are standard EM algorithms to estimate the 

TRANSFORMS mixture coefficients and the Gaussian means and covariance . 

5 However, in real applications, these parametric assumptions 

CROSS-REFERENCE TO RELATED arc often violated, and the resulting parametric density 

APPLICAnONS estimates can be tdghly biased. For example, mixtures of 

This is a non-provisional application daiming the benefit di^gpn^l Gaussians 

of provisional ^Hcation Sen No. 60/180306, filed on Fd). ^ 

4, 2000, the disclosure of which is incorporated by reference _ y jfiG(x, ^) ^ 

herein. ^ ' 

Hiis application is related to the application entitled <■ 

"High Dimensional Data Mining and \^a]ization via , , *u * *u j * • i I j j ♦u* u 

L* u • 1 • J J roughly assume that the data is clustered, and withm each 

Gaussiamzation , which IS commonly assigned and concur- , . . , j' * j o * ^- 

*i £1 J L J *!. J- 1 c a.* u • • cluster the dimensions are independent and Gaussian dis- 

rently filed herewith and the disclosure of which is mcor- ^^^^ ^ practice, Oie dimensions are often 

poratcd herem by reference. correlated within each cluster. Tins leads to the need for 

BACKGROUND modeling the covariance of each mixture component. The 

following "semi-tied** covariance has been proposed: 

1. Technical Field 20 

The present invention generally relates to high dimen- 2^A/1 

sional data and, in particular, to methods for mining and ^^cre A is shared and for each component. A,- is diagonal, 

visualizing high dimensional data through Gaussianization. semi-tied co-variance is described by: M. J. F. Gales, in 

2. Background Description "Semi-tied Covariance Matrices for Hidden Markov 
Density Estimation in high dimensions is very challeng- Models", IEEE Transactions Speech and Audio Processing, 

ing due to the so-called "curse of dimensionality". That is, MdL 7, pp. 272-81, May 1999; and R. A. Gopinath, in 

in high dimeosiorxs, data samples are often sparsely distrib- "Constrained Maximum Likelihood Modeling with Gauss- 

uted. Thus, den^ty estimation requires very large neighbor- ian Distributions", Ptoc of DARPA Speech Recognition 

hoods to achieve sufScient ccnmts. However, such large ^ Workshop,February 8-11, Lansdowne,Va., 1998. Semi-tied 

neighborhoods ccnild cause neighborhood-based techniques, covariance has been reported in the immediately preceding 

siu:b as kernel methods and nearest neighbor methods, to be two articles to significantly improve the performance of 

highly biased. large vocabulary continuous ^ccfa recognition systems^ It 

The expiratory projection pursuit density estimation shouM be appreciated that a compound Gaussian is no 

algorithm (hereinafter also referred to as the "exploratory 35 k>ngpr a diagonal Gaussian. 

projectioQ pursuit^ attempts to overcome the curse of Accordingly, there is a need for a method that transforms 

dimensionaHty by constructing high dimcnskinal densities bigh dimensional data into a standard Gaussian distribution 

via a sequence of univariate density estimates. At each which is computationany efficient, 

iteration one fin^ the naost non-GaiLssian prqe^^^^ SUMMARY OF THE INVENTION 

current data, and transforms that direction to umvanate ^ . 

GaussiarL The exploratory projection pursuit is desCTibed by The present invention is directed to high dimensional 

J. H. Friedman, in "Bcploratory Ftojection Pursuir, J. acoustic modeling via mixtures of compound Gaussians 

American Statistical Association, W6L 82, No. 397, pp. with linear transforms. In addition to providing a novel 

249-66, 1987. density model within an acoustic model, the present inven- 

Recently, independent component analysis has attracted a 45 tion also provides an iterative expectation maximization 
considerable amount of attention. Independent component (EM) method which estimates the parameters of the mix- 
analysis attempts to recover the unobsenred independent turcs of the density model as well as of the linear transform, 
sources from linearly mixed ot>senrations. This seemingly According to a first aspect of the invention, a method is 
difficult problem can be solved by an information maximi- provided for generating a high dimensional density model 
zation approach that utilizes only the independence assump- 50 within an acoustic model for one of a ^jeech and a speaker 
tion on the sources. Independent component analysis can be recognition system. Tlie density model has a plurality of 
a]^lied for source recovery in digital communication and in components, each component having a plurality of coordi- 
the "cocktail part/* problem. A review of the current status nates corresponding to a feature space. The method includes 
of independent component analysis is described by Bell et the step of transforming acoustic data obtained from at least 
al., in "A Unifying Information-Theoretic Framework for 55 one speaker into high dimensional feature vectors. The 
Independent Component Analysis", International Journal on density model is formed to model the feature vectors by a 
Mathematics and Computer Modeling, 1999. Independent mixture of compound Gaussians with a linear transform, 
component analysis has been posed as a parametric proba- Each compound Gaussian is associated with a compound 
bilistic model, and a maximum likelihood EM algorithm has Gaussian prior and models each of the coordinates of each 
been derived, by H. Attias, in ''Independent Factor 50 of the components of the density model independently by a 
Analysis", Neural Computation, Vol, 11, pp. 803-51, May univariate Gaussian mixture including a univariate Gaussian 
1999. prior, variance, and mean. 

Parametric density models, in particular Gaussian mixture According to a second aspect of the invention, the method 

density models, are the noost widely applied models in large further includes the step of applying an iterative expectation 

scale high dimensional density estimation because they offer 65 maximization (EM) method to the feature vectors, to esti- 

decent performance with a relatively small number of mate the linear transform, 'the compound Gaussian priors, 

parameters. In fact, to limit the number of parameters in and the univariate Gaussian priors, variances, and means. 
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According to a third aspect of the invention, the EM 
method includes the step of computing an auxiliary function 
Q of the EM method. The compound Gaussian priors and the 
univariate Gaussian priors are respectively updated, to maxi- 
mize the auxiliary function Q. The univariate Gaussian 
variances, the linear transform, and the univariate Gaussian 
means are respectively updated to maximize the auxiliary 
function Q, the linear transform being updated row by row. 
The second updating step is repeated, until the auxiliary 
function Q converge to a local maximum. The computing 
step and the second updating step are repeated, until a log 
likelihood of the feature vectors converges to a local maxi- 
mum. 

According to a fourth aspect of the invention, the method 
further includes the step of updating the density model to 
model the feature vectors by the mixture of compound 
Gaussians with the updated linear transform. Each of the 
compouixl Gaussians is associated with one of the updated 
compound Gaussian priors and models each of the coordi- 
nates of each of the components independently by the 
univariate Gaussian mixtures including the updated univari- 
ate Gaussian priors, variances, and means. 

According to a Mh aspect of the invention, the linear 
transform is fixed, when the univariate Gaussian variances 
are updated. 

According to a sixth aspect of the invention, the univariate 
Gaussian variances are fixed, when the linear transform is 
updated. 

According to a seventh aspect of the invention, the linear 
transform is fixed, when the univariate Gaussian means are 
updated. 

These and other aspects, features and advantages of the 
present invention will become parent from the following 
detailed description of preferred embodiments, wiadi is to 
be read in connection with the accompanying drawings. 

BRIEF DESCRIFnON OF THE DRAWINGS 

FIG. 1 is a fiow diagram illustrating a method for mining 
high dimensional data according to an illustrative embodi- 
ment of the present invention; 

FIG. 2 is a flow diagram illustrating a method for visu- 
alizing high dimensional data according to an ilhistrative 
embodiment of the present invention; 

FIG. 3 is a flow diagram illustrating steps 112 and 212 of 
FIGS. 1 and 2, re^ctively, in further detail according to an 
illustrative embedment of the present invention; 

FIG. 4 is a flow diagram of a method for generating a high 
dimensional density model within an acoustic model for a 
speech and/or a ^aker recognition system, according to an 
illustrative embodiment of the invention; 

FIG. 5 is a flow diagram of a method for generating a high 
dimensional density model within an acoustic model for a 
speech and/or a speaker recognition system, according to an 
illustrative embodiment of the invention; and 

FIG. 6 is a block diagram of a computer processing 
system 600 to which the present invention may be applied 
according to an embodiment thereof. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS 

The present invention is directed to high dimensional 
acoustic modeling via mixtures of compound Gaussians 
with linear transforms. In addition to providing a novel 
density model within an acoustic model for a speech and/or 
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speaker recognition system, the present invention also pro- 
vides an iterative expectation maximization (EM) method 
which estimates the parameters of the mixtures of the 
density model as well as of the linear transform. 
5 A general description of the present invention will now be 
given with respect to FIGS. 1-5 to introduce the reader to the 
concepts of the invention. Subsequently, more detailed 
descriptions of various aspects of the invention will be 
provided. 

10 FIG. 1 is a flow diagram illustrating a method for mining 
high dimensional data according to an illustrative embodi- 
ment of the present invention. The method of FIG. 1 is an 
implementation of the iterative Gaussianization method 
mentioned above and described in further detail below. 

1^ The log likelihood of the high dimensional data is com- 
piUed (step 110). The high dimensional data is linearly 
transformed into less dependent coordinates, by applying a 
linear transform of n rows by n columns to the high 
dimensional data (step 112). Each of the coordinates are 

20 marginally Gaussianized, said Gaussianization being char- 
acterized by univariate Gaussian means, priors, and vari- 
ances (step 114). 

It is then determined whether the coordinates converge to 
a standard Gaussian distribution (step 116). If not, then the 

^ method returns to step 112. As is evident, the transforming 
and Gaussianizing steps (112 and 114, respectively) are 
iteratively repeated until the coordinates converge to a 
standard Gaussian distribution, as determined at step 116. 
If the coordinates do converge to a standard Gaussian 

^ distribution, then the coordinates of all iterations are 
arranged hierarchically to facilitate data mining (step 118). 
The arranged coordinates are then mined (step 120). 

FIG. 2 is a flow diagram illustrating a method for visu- 
alizing high dimensional data according to an illustrative 

^ embodiment of the present invention. The method of FIG. 2 
is also an implementation of the iterative Gaussianization 
method mentioned above and described in further detail 
below, 

^ The log likelihood of the tu^ dimensional data is com- 
ptited (step 210). The high dimensional data is linearly 
transformed into less dependent coordinates, by applying a 
linear transform of n rows by n columns to the high 
dimensional data (step 212). Each of the coordinates are 
marginally Gaussianizcd, said Gaussianization being char- 
acterized by univariate Gaussian means, priors, and vari- 
ances (step 214). 

It is then determined whether the coordinates converge to 
a standard Gaussian distribution (step 216). If not, then the 
5Q method returns to step 212. As is evident, the transforming 
and Gaussianizing steps (212 and 214, respectively) are 
iteratively repeated until the coordinates converge to a 
standard Gaussian distribution, as determined at step 216. 

If the coordinates do converge to a standard Gaussian 
55 distribution, then the coordinates of all iterations are 
arranged hierarchically into high dimensional data sets to 
facihute data visualization (step 218). The high dimensional 
data sets are then visualized (step 220). 

FIG. 3 is a flow diagram illustrating steps 112 and 212 of 
60 FIGS. 1 and 2, respectively, in further detail according to an 
illustrative embodiment of the present invention. It is to be 
appreciated that the method of FIG. 3 is an implementation 
of the iterative expectation maximization (EM) method of 
the invention mentioned above and described in further 
65 detail below. 

The auxiliary function Q is computed from the log 
likelihood of the high dimensional data (step 310). It is then 
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determined whether this is the first iteration of the EM 
method of FIG. 3 (step 312). If so, then the method proceeds 
to step 314. Otherwise, the method proceeds to step 316 
(thereby skipping step 314). 

At step 314, the univariate Gaussian priors are updated, to ^ 
maximize the auxiliary function Q. At page 316, the univari- 
ate Gaussian variances are updated, while maintaining the 
linear transform fixed, to maximize the auxiliary function Q. 
TTie linear transform is updated row by row, while main- 
taining the univariate Gaussian variances fixed, to maximize 
the auxiliary fiinction Q (step 318). The univariate Gaussian 
means are updated, while maintaining the linear transform 
fixed, to maximize the auxiliary fimction Q (step 320). 

It is then determined whether the auxiliary function Q 
converges to a local maximum (step 322). If not, then the 
method returns to step 316. As is evident, the steps of 
updating the univariate Gaussian variances, the linear 
transform, and the univariate Gaussian means (316,318, and 
320, respectively) are iterativcly repeated until the auxiliary 
function Q converges to a local maximum, as determined at ^ 
step 322. 

If the auxiliary function Q converges to a local maxinmm, 
then it is detennined whether the log likelihood of the high 
dimensional data converges to a local maximum (step 324). ^ 
If not, the method returns to step 310. As is evident, the 
computing step (310) and all of the updating steps other than 
the step of updating the univariate Gaussian priors 
(316-320) are iteratively repeated until the auxiliary fimc- 
tion Q converges to a local maximum (as determined at step ^ 
322) and the bg likelihood of the high dimensional data 
converges to a local maximum (as determined at step 324). 
If the log likelihood of the high dimensional data converges 
to a local maximum, then the EM method is terminated. 

FIG. 4 is a flow diagram of a method for generating a high 35 
dimensional density model within an acoustic model for a 
speech and/or a ^eaker recognition system, according to an 
illustrative embodiment of the invention. The density model 
has a plurality of components, each component kiving a 
plurality of coordinates corresponding to a feature space. It ^ 
is to be noted that the method of FIG. 4 is directed to 
forming a density model having a novel structure as further 
described below. 

Acoustic data obtained firom at least one ^aker is 
transformed into high dimensional feature vectors (step 45 
410). The density model is formed to model the high 
dimensional feature vectors by a mixture of compound 
Gaussians with a linear transform (step 412). Each of the 
compound Gaussians of the mixture is associated with a 
compound Gaussian prior and models each coordinate of 50 
each component of the density model independently by. 
univariate Gaussian mixtures. Each imivaiiate Gaussian 
mixture comprises a univariate Gaussian prior, variance, and 
mean. 

FIG. 5 is a flow diagram of a method for generating a high 55 
dimensional density model within an acoustic model for a 
speech and/or a speaker recognition system, according to 
another illustrative embodiment of the present invention. 
The density model has a plurality of components, each 
component having a plurality of coordinates corresponding 60 
to a feature space. It is to be noted that the method of FIG. 
5 is directed to forming a density model having a novel 
structure as per FIG. 4 and, further to estimating the param- 
eters of the density model. The method of FIG. 5 is an 
implementation of the iterative expectation maximization 65 
method of the invention mentioned above and described in 
further detail below. 
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Acoustic data obtained from at least one speaker is 
transformed into high dimensional feature vectors (step 
510). The density model is formed to model the feature 
vectors by a mixture of compouid Gaussians with a linear 
transform (step 512). Each compound Gaussian is associated 
with a compound Gaussian prior. Also, each compound 
Gaussian models each coordinate of each component of the 
density model independently by a imivariate Gaussian mix- 
ture. Each univariate Gaussian mixture corpprises a imivari- 
ate Gaussian prior, variance, and mean. , 

An iterative expectation maximization (EM) method is 
applied to the feature vectors (step 514). The EM method 
includes steps 516-528. At step 516, an auxiliary function of 
the EM method is computed from the log likelihood of the 
high dimensional feature vectors. It is then determined 
whether this is the first iteration of the EM method of FIG. 
5 (step 518). If so, then the method proceeds to step 520. 
Otherwise, the method proceeds to step 524 (thereby skip- 
ping steps 520 and 522). 

At step 520, the compound Gaussian priors are updated, 
to maximize the auxiliary function Q. At step 522, the 
univariate Gaussian priors are updated, to maximize the 
auxiliary function Q. 

At step 524, the imivariate Gaussian variances are 
updated, while maintaining the linear transform fixed, to 
maximize the auxiliary function Q. The linear transform is 
updated row by row, while maintaining the univariate Gaus- 
sian variances fixed, to maximize the auxiliary function Q 
(step 526). The univariate Gaussian means are updated, 
while maintaining the linear transfcsm fixed, to maximize 
the auxiliary fiinction Q (step 528). 

It is then determined whether the auxiliary function Q 
converges to a local maxinmm (step 530). If not, then the 
method returns to step 524. As is evident, the steps of 
updating the univariate Gaussian variances, the linear 
transform, and the univariate Gaussian meaiK (524, 526, and 
528, respectively) are iteratively repeated until the auxiliary 
function Q converges to a local maximum, as detennined at 
step 530. 

If the auxiliary functk)n Q converges to a local maximum, 
then it is determined wbether tbe log likelihood of the high 
dimensional feature vectors converges to a local maximum 
(step 532). If not, the method returns to step 516. As is 
evident, the computing step (516) and all of the updating 
steps other than the steps of updating the compound and 
univariate Gaussian priors (524-528) are iteratively 
repeated until the auxiliary function Q converges to a local 
maximum (as detennined at step 532). 

If tbe log likelihood of the high dimetxsional feature 
vectors converges to a local maximum, then the density 
model is updated to model the fieature vectors by the mixture 
of compound Gaussians with the updated linear transform 
(step 534). Each compound Gaussian is associated with an 
updated compound Gaussian prior and models each coordi- 
nate of each component independently by the univariate 
Gaussian mixtures. Each univariate Gaussian mixture com- 
prises an updated univariate Gaussian prior, variance, and 
mean. 

More detailed descriptions of various aspects of the 
present invention will now be provided. The present inven- 
tion provides a method that transforms multi-dimensional 
random variables into standard Gaussian random variables. 
For a random variable XeR", the Gaussiamzation transform 
T is defined as an invertible transformation of X such that 
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The Gaussianization transform correspoods to a density vided wherein, at each iteration, a maximum-likelihood 
estimation parameter estimation problem is solved. Let 4»(x) be the 



probability density function of the standard normal 



llie Gaussianization method of the invention is iterative 

and converges. Each iteratioQ is parameterized by univariate and let ^(x) be the cumulative distribution function (CDE^ 
parametric density estimates and can be cfGciently solved by iq of the standard normal 
an expectation maximization (E^f1) algorithm. Since the 

Gaussianization method of the invention involves univariate f I ( ^) 

deiKsity estimates only, it has the potential to overcome the = J ^^"H ~T y^' 

curse of dimensionality. Some of the features of the Gaus- ~^ 
sianization method of the invention include the following: 15 

(1) The Gaussianization method of the invention can be A description of one dimensional Gaussianization will 
viewed as a parametric generalization of Friedman's now be given. Let us first consider the univariate case: XeR^ . 
exploratory projection pursuit and can be easily accom- Let F(X) be the cumulative distribution function of X: 
modated to perform arbitrary d-dimensional projectioo 

pursuit. 20 ^(pO^^)' 

(2) Each of the iterations of the Gaussianization method n can be easily verified that 
of the invention can be viewed as solving a problem of 

independent component analysis by maximum likeli- *"*(fi(X)>^(D4> (i) 
hood. Thus, according to an embodiment of the 

invention, an EM method is provided that is computa- 25 In practKe, the CDF F(X) is not avaHable; it has to be 

tionaUy more attractive than the noiseless independent estimated fitom the training data. According to an embodi- 

factor analysis algorithm described by H. Altias, in J^f °* ^ invention, it is approximated by Gaussian 

"Independent Factor Analysis^, Neural Cbmputation, mixture models 
Vol. 11, pp. 803-51, May 1999. 

(3) Tlie probabilistic models of the invention can be p{x)^Y mGiXyPi.af) 
viewed as a generalization of the mixtures of diagonal ^ ' 
Gaussians with explicit desaiption of the non- 

Gaussianity of each dimension and the dependency 

among the dimensions. * ^ assume the CDF 

(4) The Gaussianization method of the present invention ^ 
induces independent stmctures which can be arranged ^^^^ ^ y 
hierarchically as a tree. ^ ^ 

In the problem of classification, we are often interested in 
transforming the original feature to obtain more discrimi- ^ 

nativefeatures.Tothisend,mosloftbefocusofthepriorart Therefore, we parameterize the Gaussianization transform 

has been on linear transforms sach as: linear discriminant ^ 
analysis (LDA); maximum likelihood linear transform 

(MLLI); and semi-tied covariances. In contrast, the present -(t/v a/'~^i1 

invention provides nonlinear feature extraction based on \ <r; JJ 

Gaussianization. Ef&cient EM type methods are provided 

herein to estimate such nonlinear transforms. 

A description of Gaussianization will now be given. For where tte parameters [n^, o,-} can be estimated via 

a random variable XeR", we define its Gaussinization trans- maximum likelihood using the standard EM algorithm. 

form to be invertible and its differential transform TpQ such A description of the existence of high dimensional Gaus- 

that the transformed variable T(X) follows the standard sianization will now be given. For any random variable 

Gaussian distribution: XeR", the Gaussianization transform can be constructed 

theoretically as a sequence of n one dimensional Gaussian- 

nTM^(o;). ization. Ut X^^-X. Let p^*^(x/°>, . . . , x„H be the 

^ tt f 11 • ^ • ^ ' probability density function. First, we Gaus^anize the first 
Naturally, the followmg questions anse. Does a Gaussa- 55 ^Q^j^j^i^ X 

anization transform exist? If so, is it unique? Moreover, how ^ 

can one construct a Gaussianization transformation from Xi^'^T^^i^^'V^^ Pf^QC^^) 

samples of the random variable X? As is shown below with 

minor regularity assumptions on the probability density of where F^pT^ is the marginal CDF of Xi^°^ 

X, a Gaussianization transform exists and the transform is 60 r ^ 

not unique. The constructive algorithm, however, is not Vi*C^i 

amenable to being used in practice on samples of X. For « ^ ^^,;„^„„ uft „«^,««.j 

c ■ ' ^ . r ^ t Ihe rcmaming cooramates are left imcbanged 

estimation of the Gaussiamzation transform fiom sample ^ 

daa (i.e., given Lid. observations {X,:l^i^L} and regu- xJ'^^JCj^ . . . f» 

larity conditions on the probabihty density function of X 65 

viz., that it is strictly positive and continuously Let p^^^(xi^^^, x„^^^ be the denaty of the transformed 

differentiable), an iterative Gaussianization method is pro- variable X^^\ Qearly, 
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We can then Gaussianize the conditional distribution p^^^^ and parametrize the Gatissianization transform as 



10 



where <P(F^<i)^ (O"^) is the CDF of the conditional density 
P^^\^2^^^\^i y remaining coordinates are left 
unchanged: and 



20 



Let p<^>(x/^>, . . . , xj^^^ be the density of the transformed 
variable X^^^: 

Then, we can Gaussianize the conditional density p^\x^^^^ 
Xj^^ and so on. After n steps, we obtain the trans- 
formed variable x^'*^ which is standard Gaussian: 

The above construction is not practical (ie., we cannot 
apply it easily when we have sample data finom the original 
random variable X) since at the (k+l)-th step, it requires the 
conditional density p^*'*(Xjt+i^*^|xi^*\ . . . x^^^*^ for all possible 
i^^*^, . . . Xj^^*\ which is extremely difficult given finite 
sample points. It is clear that the Gaussianization transform 
is not imiquc since the construction above could have used 
any other ordering of the coordinates. Advantageously, one 
embodiment of the invention provides a novel, iterative 
Gaussianization method that is practical and converges; 
proof of the former is provided below. 

A descr^tion of Gaussianization with independent u: 
component analysis assumption will now be given. Let 
X-(Xj, . . . , x^ be the hi^ dimensional random variable 
to be Gaussianized. If we assume that the individual dimen- 
sions are independent, i.e.. 



has independent components: 

Therefore, we can first find the linear transformation A, 
and then Gaussianize each individual dimension of Y via 
univariate Gaussianization. The linear transform A can be 
recovered by independent component analysis, as described 
by Bell et al., in "An Information-Maximization >^roach 
to Blind Separation and Blind J^econvolution", Neural 
Computation, \bl 7, pp. 1004-34, November 1999. 

As in the univariate case, we model each dimensicm of Y 
by a mixture of univariate Gaussians 



The parameters O^A, jCj^ fi^^ a^J) can be efiSciently 
estimated by maximum likelihood via EM. Let 
{x„€R^:l^n^N} be the training set. Let {y = 
Ax„:l^n^N} be the transfonned data. Let {y^R^, x^^, 
z„€N^:l^n^N} be the complete data, where z„-(Vi * * • 
z^) indicates the index of the Gaussian component along 
each dimension. It is clear that 

^^Miiltinoinial(l,(n^,, . . . , x^^J) 



30 



35 



y^n"^ arc independent 



y^. 



We convert z^ into binary variables {(z^j^i, . . . , ^^j^- 



Therefore, 



we can then simply Gaussianize each dimension by the 45 
univariate Gaussianization and obtain the global Gaussian- 
ization. 

However, the above independence assumption is rarely 
valid in practice. Thus, we can relax the assumption by using 
the following independent component analysis assumption: 50 
assume that there exists a linear transform such that the 
transformed variable 



We obtain the following complete log likelihood 



55 



60 



where 6=(A, Ji, /i, a). 

In the E-step, we compute the auxiliary function 

e(e,e)-£(L(xi^eM) 
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12 



Id the M-step, we maximize the auxiliary function Q to 
update the parameters 8 as 

aig vass^efi\ 

From the first order conditions on (jt, a), we have 



In the prior art, gradient descent was used to optimize Q, 
as described by H. Attias, in "Independent Factor Analysis", 
Neural Computation, Vol. 11, pp. 803-51, May 1999. 
According to the prior art, the following steps were per- 
5 formed at each iteration: 

(1) Fix A and compute the Gaussian mixture parameters 
a) by (3). 

(2) Fix the Gaussian mixture parameters (jc, fi^ a) and 
10 update A via gradient descent using the natural gradi- 
ent: 



'k4 ff 



(3) 



15 



(5) 



*'=1 n-l 



20 



where 'r)>0 determines the learning rate. The natural 
gradient is described by S. Amari, in "Natural Gradient 
Works Efficiently in Learning", Neural Computation, 
V6\. 10, pp. 251-76, 1998. 
According to an embodiment of the present invention, an 
25 iterative gradient descent method will now be derived which 
does not involve the nuisance of determining the stcpsize 
parameter i\. At each iteration of the iterative gradient 
descent method of the invention, for each of the rows of the 
matrix A: 

^ (1) Fix A and compute the Gaussian mixture parameters 
{jt,/*,a)by (3). 
(2) Update each row of A with all the other rows of A and 
the Gaussian mixture parameters (n, /<, a) fixed. Let a^ 
be the j-th cow; let c-<Ci, . . . , d^) be the co-factors of 
A associated with the j-th row. Note that a,- and Cj are 
both row vectors. Then 



35 



i=l 



Let e^ be the column vector which is 1 at the position d: 45 
eXO, ...» 0, 1, 0, .... 0)'^ 



We obtain the gradient of the auxiliary function with 
respect to A: 



1 <f=i 't=i 

N D la 



= N(A 



n-l <f-l i-l 



M D U 



«=i d=i i=l 



n=i d=\ i=l 

='v(^-'/-ZZZ'^^^^^"^- 

M-1 d-i i-l 



U) _ 

odZti 

dj 



50 



(4) 



55 



60 



Notice that this gradient is nonlinear in A. Hierefore, 65 
solving the first order equations is nontrivial and it requires 
iterative techniques to optimize the auxiliary function Q. 



where 



£^(z^4 



Therefore, 

oaj ojc] 
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^ where px and py are the densities of X and Y respectively. 
So we have a-(Bc^h,)G.-^ and ^ particular, we define the negentropy of a random variable 



i^j + 10 It is to noted that wc are taking some slight Kbcrty with 

the terminology; normally the negentropy of a random 
variable is defined to be the KuDback-Leiblcr distance 
between itself and the Gaussian variable with the same mean 
and covariance. 

Id traditional information theory, mutual information is 
defined to describe the dependency between two random 
Therefore, the j-th row of can be updated as variables. According to the invention, the mutual informa- 

tion of one random variable X is defined to describe the 
^yKP^y+^yXo"* dependence among the dimensions: 

where P can be solved firom the quadratic equation (6). j^x) = f pxixi ;^)log ^ 

It is to be appreciated that the iterative Gaussianizatioo ^ pxiixo... px^M 
method of the present invention increases the auxiliary 

function Q at every iteration. It would seem that after 25 where px^x^ is the marginal density of X^ Qearly I(X) is 

updating a particular row ay, wc arc required to go back and always nonnegative and I(X)-0*5Xi, ... X„ arc mutually 

update the Gaussian mixture parameters (jc, ^ a) to guar- independent. 

antee improvement on Q. However, that is not necessary A description of six assumptions relating to negentropy 

because of the following two observations: and mutual information will now be given, followed by a 

(1) The update on a,- depends only on (jv^A-^a.-^ but not ^° proof Uiereof. ^ „f 

f V7^r7^ negentropy can be decomposed as the sum of 

on {3ij^, fij^, Oj/a^y marginal negentropies and the mutual information. Thus, for 

(2) The update on (njj, fi^j, OjJ) depends only on a^ but not purposes of the invention, assume that for any random 
on (a^dp*j). variable X^x^, . . . xj^ 

Therefore, it is equivalent that we update A, row by row, 35 

with the Gaussian parameters fixed and update all the « 

Gaussian mixture parameters in the next iteration. Hxty-t-iiX). 

An EM algorithm for the same estimation problem; 
referred to as the noiseless independent factor analysis, was 

described by H. Attias, in ^'Independent Factor Analysis", 40 We shall call the sum of negentropies of each dimension the 
Neural Computation, Vol. 11, pp. 803-51, May 1999. The marginal negentropy 
M-step in the algorithm described by Attias involves gradi- 
ent descent based on natural gradient In contrast, our « 
M-step involves closed form solution of the rows of A; that Jm iX) = ^ J(^). 
is, there is no gradient descent. Since the iterative Gaussi- 45 
anization method of the invention advantageously increases 

the auxiliary function Q at every iteration, it converges to a Second, negentropy is invariant tmder orthogonal linear 

local maximum. transforms. Thus, for the purposes of the invention, assume 

A description of iterative Gaussianization transforms for ^^^i fo^ ^ny random variable XcR** and orthogonal matrix 
arbitrary random variables according to various embodi- 50 
ments of the present invention will now be given. Conver- 
gence results are also provided. J(^X)^QO- 

The iterative Gaussianization method of the invention is jhis easily follows from the fact that the KuUback-Lcibler 
based on a series of univariate Gaussianization. We param- distance is invariant under invcrtible transforms: 
eterize univariate Gaussianization via mixtures of univariate 55 . ^ . ^^^^ T.rr^^^ ^.xr/«r^^ 
Gaussians. as described above. In our analysis do not /(«H)t«nW^>B<W'A(0^)^ 
consider the approximation error of univariate variables by Third, let V be the ideal marginal Gaussianization opera- 
mixtures of univariate Gaussians. In other words, we assume tor which ideally Gaussi anizes each dimension. Thus, for the 
that the method uses the ideal univariate. Gaussianization purposes of the invention, assume that for any random 
transform rather than its parametric approximation using 60 variable X, 
Gaussian mixtures. JU^iX))^ 

To define and analyze the iterative Gaussianizatioo j^^\^)f^- 

method of the invention, we first establish some notations Fourth, mutual information is invariant under any invert- 

and assumptions. ible dimension-wise transforms. Thus, for the purposes of 

We define the distance between two random variables X 65 the invention, let any random variable X=(Xi, . . . xj^; we 

and Y to be the Kullback-Leibler distance between their transform X by invertible dimension-wise transforms f/x) 

density functions :R^-»R^ 
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let Y=(y3, . . . , y^^ be the transformed variable. Then 

Fifth, the distance between two densities is bounded 
fiom above by the square root of the Kullback-Leibler 
distance. Thus, for the purposes of the invention, let £(x) and 
g(x) be two n-dimensional density functions, then 



where the parameters e=(A,JC/*,a) and ^p^a ^ Marginal 
Gaussianization operator parameterized by univariate Gaus- 
sian mixture mcdels that Gaussianizes each individual 
dimension as per equation (2). Qearly, when the number of 
Gaussian components goes to infinity, we achieve perfect 
Gaussianization 



J|fltr>-iF(x)Mr^[2D(fflg)r. 



10 



lim JiXeiXy) = a 



Sixth, a weak cooveigence result is proved on random 

variables, similar to Proposition 14.2 described by P. J. For the arbitrary random variable X, Tq docs not neccs- 

Huber, in "Projection pursuit". Annals of Statistics, 13, sarQy achieve Gaussianization. However, we can still run the 

pp. 435-525, Apnl 1985. Thus, for the purposes of the 15 same maximum likelihood EM algorithm. In fact, if we 

invention, let X and (X^^^ X^^^ . . . ) be random variable in iteratively apply Tq, we can prove that Te achieves Gaus- 

R"; then sianization. More specifically, let X^^-X; let 



*-*11a||2-^ 
implies 

Regarding the proof of the six assumptions above, let p^*^ 
Pa^*^ p„ be the density functions of X<*\ Y, a^<*\ 
respectively. By the fi^ assumption, we have: 



(8) 



20 



P, 25 



where the parameters 6^^ are estimated from the data 
associated with the k-th generation X*. We shall prove that 
X^*^ converges to the standard normal distribution. 

A description of the relatLonship of iterative Gaussianiza- 
tion to maximum likelihood will now be given. We first 
show that the maximuim likelihood EM method of the 
invention described above actually minimizes the negent- 
ropy distance 



30 



nunJiTeiX)), 



(9) 



Hemx, the characteristic functions tpj'^ of p^^^^ converge Let X, YeR" be two random variables; let TqiR"— R" be a 

uniformly to the characteristic fimction y^^^ of p(»: parameterized invertible transform with parameters 6. Then, 

Let y^*^ and W be the characteristic functions of the joint 35 firxiing the best transform 
densities p^*^ and p, respectively. The marginal diaracteristic 
functions can be related to the joint characteristic functions mmiKTaCX)!!/) 
as " 



Therefore, 



40 is equivalent to maximizing a likelihood function which is 
parameterized by 0: 



afi 0.0 

In particular by setting 9-1, 

a 

Le., the characteristic functions of X^*^ converge uniformly 
to the characteristic function of X. Therefore X^^ converges 
weakly to X, by the continuity theorem of characteristic 
functions, i.e., the densities p^*-^ converge to p pointwise at 
every continuous point of p. 

Adescription of the iterative Gaussianization method a of 
the present invention will now be given. Let X=(Xi, . . . , xj^ 
be an arbitrary random variable. We intend to find an 
invertible transform TtR^-^R" such that T(X) is distributed 
as the standard Gaussian, Le. we would like to find T such 
that 

J<m)-o. (7) 

When X is assumed to have independent components 
after certain linear transforms, we have constructed the 
GaussianizatioD transform above via the EM algorithm: 



45 

where the density model pe(x) is the density of T^OO. This 
easily follows firom 

50 

A convergence proof will now be given regarding the 
iterative Gaussianization method of the invention. We now 
analyze the Gaussianization process 

where the linear transform A and the marginal Gaussianiza- 
tion parameters Itc, /i, a} are obtained by minimizing the 

60 negentropy of X^ ^ For the sake of simplicity, we assume 
that we can achieve perf&ct univariate Gaussianization for 
any univariate random variable. In fact, when the number of 
Gaussians goes to infinity, it can be shown that the univariate 
Gaussianization parameterized by mixtures of xmivariate 

65 Gaussians above indeed converges. Tlierefore, in our con- 
vergence analysis, we do not consider the approximation 
error of univariate variables by mixtures of univariate Gaus- 
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sians. We assume the perfect maigiiial Gaussianizatioa cen- 
ter V is available which does not need to be estimated. We 
shall analyze the ideal iterative Gaussianization procedure 

x*** y*** x^*** 

where the linear transform A is obtained by minimizing the 
negentiopy of X^**^\ Obviously, if the ideal iterative pro- 
cedure converges, the actual iterative procedtire also 
converges, when the number of univariate Gaussians goes to 
infinity. 

From the first and third assumptions above, it is clear that 
the acgentropy of X^***^ is 

Therefore, the iterative Gaussianization process attempts to 
linearly transform the current variable X^*^ to the most 
independent view: 

min7(*(/lA**')) <=> nun/(AA***). 

We now prove the convergence of the iterative Gaussi- 
anization method of the invention. Our proof resembles the 
proof of convergence for projectioii pursuit density esti- 
mates described by R J. Hubcr, in "Projection Pursuir, 
Annals of Statistics, 13, pp. 435-525, ^ril 1985. Let 

be the reduction in the negentropy in the k-th iteration: 

A 

Since {J(X^^} is a monotooically decreasing sequence and 
bounded from below by 0, we have 

UmA**> = 0. 

In fact, for any given €>0, it takes at most kaJ(XP^/e to 
reach a variable X^*^ sucb that A^*^^€. 

Following the argument presented in the inmiediately 
preceding article by Huber, the maximum marginal negent- 
ropy is defiried as 

J*{X)= sup J{a^xx 

Qearly, JpC^>)-*0 impHes J*(X<*^)-K). However, the 
reverse is not necessarily true. We shall show that the 
maximum marginal negentropy of X^^ is bounded from 
above by A**-*. For any unit vector a(||a||2"l), let U^^ be an 
orthogonal completion of en 

Applying the first and secoixl assumptions above, we 
have 
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Therefore, 

11^1 Ua eitinrayA 

5 

i.e., 

10 Since A^^-*0, we have 

Applying the sixth assumption, we can now establish our 
convergence result. 
15 For iterative Gaussianization as in equation (8), 

in the sense weak convergence, i.e., the density function of 
X^*^ converges pointwise to the density function of standard 
20 normal. 

A description will now be given of how the iterative 
Gaussianization method of the invention can be viewed as a 
parametric generalization of high dimensional projection 
pursuit. 

^ A nonparametric density estimation scheme called pro- 
jection pursuit density estimation is described by Friedman 
et aL, in ''Projection Pursuit Density Estimation**, J. Ameri- 
can Statistical Association, Vol. 79, pp. 599-608, September 

1984. Its convergence result was subsequently proved by P. 
^ J. Huber, in "Projection Pursuit", Annals of Statistics, Vol. 

13, pp. 435-525, i^il 1985. In this sdieme, the density 
function f(x) of a random variable XeR" is approximated by 
a product of ridge functions 

where po(x) is some standard probability density in R" (e.g. 
a normal density with the same and covarianoe as p(x)). At 
each iteration (k+1), the update h^^ can be obtained by 

A*M(*)-p(C«fa.l'jf)/A("**l'0 

where the direction a^^^ is constrained to have unit length 
llat^illi-l and can be computed by 

ai^i^arspmxDip(!0^x)\\pt{aP'x)) (10) 

a 

50 

It has been suggested to estimate the marginal densities 
p(a^x) of the data by histograms or kernel estimates with the 
observed samples and to estimate the marginal density 
pj^a^xy of the cunent model by histograms or kernel esti- 

55 mates with Monte Carlo samples, then to optimize (as in 
equation 10) by gradient decent. Hiese suggestions are made 
by: Friedman et al., in "Projection Pursuit Density 
Estimation", J. American Statistical Association, Vol. 79, pp. 
599-608, September 1984; and P. J. Huber. in "Projection 

60 Pursuir, Annals of Statistics, Vol, 13, pp. 435-525, April 

1985. Id our view, this scheme attempts to approximate the 
multivariate density by a series of univariate densities; the 
scheme finds structures in the domain of density function. 
However, since there is no transforms involved, the scheme 

65 does not explicitly find structures in the data domain. In 
addition, the kernel estimation, the Monte Carlo sampling, 
and the gradient decent can be quite cumbersome. 
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Data is explicitly transformed by the exploratory projec- where the operator marginally Gaussianizes the first 1 
tion pursuit described by J. H. Friedman, in "Exploratory coordinates and leaves the remaining coordinates fixed; A is 
Projection Pursuit", J. American Statistical Association, Vol obtained by minimizing the negentropy of X^**^^: 
82, pp. 249-66, March 1987. Let X^°^ be the original random 

variable. At each iteration, the most non-Gaus^an projection 5 mnJUt^A^^)) U2) 

of the current variable X^*^ is found: ^ 

at = argnunD{a^ x^^\Oc) In practice, we estimate the ideal operator by mixtures 

" of univariate Gaussians as described abofve. For XcR", we 

10 define 

where G^ is the univariate Gaussian random variable with mirywur i 

the same mean and covariance as a^X<*>. Let U be an ^CXWt (xj . . . W 

orthogonal matrix with a^^ being its first column where 

One then transforms X^^^ into the U domain: 

20 

The first coordinate of Y is Gaussianized, and the ^/(pc^y-xJiM^d^n 

remaining coordinates are left unchanged: ^ ^ equivalent to assuming that X has the following 

^^.vCvO distribution: 
i^J^ . . . , rt 
One obtains X^**^> as 



25 







MnZ'*^^-'^-^'t.ol><fn 0,1)1 



^ Therefore, the EM method of the invention described 
above can be easily accommodated to model by setting 
If we denote ^3 as the operator which Gaussianizes the number of univariate Gaussians in dimensions 1+1 

first coordinate and we leave the remaimr^ coordinates through n to be one: 
unchanged, then we have 



J^*^X-l/W(U'^>) 35 



A description of a constrained version of the iterative 

TTie two dimensional exploratory projection pursuit was Gaussianization method of the invention will now be given, 

also described by I H. Fricdnaan, in "Exploratory Projection ^j^t whiten the data and then consider only orthogonal 

Pursuit", J. American Statistical Association, \fol. 82» pp. projections for two dimensional projection pursuit has been 

249-66, March 1987. According to the two dimensional 40 described by J. H. Friedman, in "Exploratory Projection 

exploratory projection pursuit, at each iteration, one locates Pursuir, J. American Statistical Association, 82, pp. 

the most jointly non-Gaussian two dimensional projection of 249-66, March 1987. Here, the same assumption is made; 

the data and then jointly Gaussaanizes that two dimensional constrain the linear transform A to be a whitening 

plane. To jointly Gausaanize a two dimensional variable, transform W foUowed by an orthogonal transform U 

rotated (about the origin) projections of the two dimensional 45 

plane are repeatedly Gaussianized until it becomes like a-^jw 

normal More specifically, letY«(y„y-*)^be the two dimen- . „, . . , , wj-^ 

sional pUne which is most non-Ga^. "^^^'^ whitemng matrix W whitens the data X(*> 

50 

y cos Y+y2 »m r example, W can be taken as 

be a rotation about the origin through angle y. y'l and y'2 can 
then be Gaussianized individually. This process is repeated partition U as 

(on the previously rotated variables) for several values of 
Y=(0,ny8,j^/433ty8, . . . ). This entire process is then repeated 
until the distributions stop becoming more normal. 

The iterative Gaussianization method of the invention can 
be ca^y modified to achieve high dimensional e^^^^^^ j ^e^ ^ ^^^^ 

projection pursuit with efficKnt parametric optunmton. V*>.WX<*>. Applying the fiist, thirf, and second 

As^ime that we arc mterestcd m 1-dimensional projections ^ssum tions above we Iwve^ 
where l=l^n. The modification will now be described. At ^ ' 

each iteration, instead of marginally Gaussianizing all the j{^i{^^yff^ij[:fii>'^{jjy^^^ 
coordinates of AX^'^ we marginally Gaussianize the first 1 
coordinates only: and 

j!f<^^W^>) (11) j(y<*>)-^£/y«*>)>^^ui<*)>+/(£n'(*)). 
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Takkg the difference, we have 

Therefore, the modified iterative Gaussianizatioa method 
(11) with the constraint 

A-UW 

is equivalent to finding the l-dimensional marginally most 
non-Gaussian orthogonal projection of the whitened data. 
Let 0^ be the 1-dimen^onal marginally most non-Gaussian 
orthogonal projection of the whitened data: 

^i-aig max UjJm{UJVX<^^ 

Let 0 be any orthogonal completion of Oj: 

»■(:;} 

Then 0 is a solution of the constrained modified iterative 
Gaussianization procedure 



msi Bi 
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where consists of the last (n-1) rows of A, we can also 
argue that this unconstrained procedure attempts to find the 
most independent view in which the 1 most marginally 
non-Gaussian directions can be choseiL 

5 A description of hierarchical independence pursuit 
according to an embodiment of the invention will now be 
given. In the construction of probabilistic models for high 
dimensional random variables, it is crucial to find the 
independence or conditional independence structures 

10 present in the data. For example, if we know that the random 
variable XeR" can be separated into two variables X'^X^X^ 
such that XjcR'^ is independent of R"^, we can then 
construct probabilistic models separately for and X^, 
which can be much simpler than the original problem. 

IS We attempt to transform the original variable to reveal 
independence structures. The iterative Gaussianization 
method of the invention can be viewed as independent 
pursuit, since when it converges, we have reached a standard 
normal variable, wfaidi is obviously independent dimensioo- 

20 wise. Moreover, the iterative Gaussianization method of the 
invention can be viewed as finding independent structures in 
a hierarchical fashion. 

We first analyze the ability of finding independent struc- 
tures at each Gaussianization iteration. We have shown 

25 above that each step of Gaussianization minimizes the 
mutual information 



The EM method of the invention must be modified to 
acconmiodate for the orthogonality constraint on the linear 
transform. Since the ordinal £M method of tbe invention 
estimates each row of the matrix, it caxmot be directly 
applied here. We propose using either the well-known 
Householder parameterization or Given's parameterization 
for the orthogonal matrix and modifying the M-step accord- 
ingly. 

The weak convergence result of this constramed modified 
iterative Gaussianization can be easily established by deriv- 
ing the key inequality 

and following the same proof desCTibed above with respect 
to convergence. Obviously for the larger I, the convergence 
is faster, since A^*\ the improvement at iteration k, is larger. 

Friedman's high dimensional exploratory projection algo- 
rithm attempts to find the most jointly non-Gaussian 1 
dimensional projection. The computation of the projection 
index involves high dimensional density estimation. In 
contrast, the Gaussianization method of the invention tries to 
find the most marginally non-Gaussian 1 dimensional pro- 
jection; it involves only univariate density estimation. 

The bottleneck of l-dimensional e^loratory projection 
pursuit is to Gaus^anize the most jointly non-Gaussian 
l-dimensional projection into standard Gaussian, as 
described by J. H. Friedman, in "Exploratory Projection 
Pursuit^, J. American Statistical Association, \bl. 82, pp. 
249-66, March 1987. In contrast, the Gaussianization 
method of the invention involves only univariate Gaussian- 
ization and can be computed by an efficient EM algorithm. 

A description of an unconstrained version of the iterative 
Gaussianization method of the invention will now be given. 
If we do not employ the orthogonahty constraint and we 
solve the more general problem (as per equation 12), we 
would get faster convergence, and the original EM method 
of the invention can be used vsathout any modification. At 
each iteration, since 



nnnI(AX^^) (13) 

A 

30 

If there exists a linear transfonn such that the transformed 
variable can be separated into independent sub variables, 
then our minimization of the mutual information will find 
that independent structure. This is precisely the problem of 
Multidimensional Independent Component Analysis pro- 
posed by J.-F. Cardoso, in "Multidimensional Independent 
Component Analysis", Proceedings of ICASSP '98, Seattle, 
Wash., May 1998. It is a generalization of the independent 
component analysis^ where the components are independent 
^ of one another but not necessarily one dimensionaL We call 
a random variable XcR" minimal if any linear transform on 
X does not reduce the mutual infionnation 

45 

and all the dimensions of the transformed variable Y are 
dependent. From the description provided by Cardoso in the 
immediately preceding article, we have the following result 
that minimizing the mutual information (13) finds the under- 
lying minimal structure. 

Let Y«(Yi, . . . , Yjfc^ where each X^^*" is minimal. Let 
the c4)served variable X be a certain Ursear transform of 
Y:X-BY. Then, the solution of minimizing tbe mutual 
information is 

55 

B-* ^argaiiJiAX), 



The iterative Gaussianization method of the invention 
60 induces independent structures which can be arranged hier- 
archically as a tree. Assume that at the k-th iteration, X^*^ has 
multidimensional independent components X**^-(Xi^*^, . . . , 
X^^*^^, where each multidimensional component X,**^ is in 
R'' and 2^i'"njon. It can be easily shown that the next 
65 iteration of Gaussianization is equivalent to running one 
iteration of Gaussianization on each individual multidimen- 
sional component X^-^*^. 
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. If X has multidimensional independent components above can be viewed as parametric density estimation. We 
X«=(X;i, . . . , X^, then now describe two forms of density estimates based on 

Ganssianization according to the invention: 
ar^nanJCHAX)) <=> argmhJ(f{A jX;))v i^jsm, (1) Iterative Models which can be directly obtained from 

* S the iterative Gaussianization method of the invention 

described above. 

Therefore, the iterative Gaussianization method of the (2) Mixture Models in which each cluster corresponds to 
invention can be viewed as follows. We run iterations of one iteration of Gaussianization. 

Gaussianization on the observed variable until we c^tain A description of iterative Gaussianization densities 
multidimensional independent components; we then run according to the invention wiU now be given. Consider the 
iterations of Gaussianzation on each multidimensional inde- sequence of optimal parametric Gaussianization transforma- 
pendent component; and we repeat the process. This induces ^ions T^ (as described above) acting ^ a random variable 
a tree which describes the independence structures among X.X<°>. Since the sequence X^^-T^X^*"^) converges to a 
the observed random variable XeR". The tree has n leaves. standard normal distribution, for sufficientiy large k, X^ is 
Each leaf 1 is associated with its parents and grandparents approximately Gaussian. Since the transformations Tg^ are 
and so on: invertible, a standard normal density estimate for X^^ 

implies a density estimate for X. Indeed, if we define for 
-^i^it^itT * • '^nM each k 

where nf denotes the parent node of n and is the 20 y(*x,r -^r r (14) 

observed variable X. In this tree, all the children of a oi - % 

particular node are obtained by running iterations of Gaus- where N-N (0, 1), then the density of Y<*^ can be corisidered 

sianization until we obtain multidimensional independent a parametric estimate of the density of X for each k. The 

components; the multidimensional independent components larger k, the better the estimate. If (y**^ denotes the 

are the children. 25 Jacobian determinant of T©; evaluated atV*\ i e.. 

In practice, we need to detect the existence of multidi- 
mensional independent components. Accordingly, we pro- 

vide two relevant methods. The first method is based on the ^ (y'*') ~ 
fact that if X«(X;i, . . . > x^ has multidimensional independent 

components X-pC^, . . . , XJ, then the matrix A, which we ^ 

obtain by rurming one iteration of Gaussianization on X then, the probability distribution of Y^*^ is given by 

35 



will be block diagonal More specifically, we have ^ 
Xf is independent of x^. Therefore, we define the distance 

between dimension i and j to be the absohite vahie of A,y Essentially, for each k, Y<^ comes &om a parametric family 

dtf[Atji #of densities^ for example -j^ defined by Equation 15., 

^ The iterative Gaussianization method of the invention can 

and mn the bottom-iip hierarchical chistering scheme with ^ viewed as an ^roximation process where the probabU- 

maximum linkage. We can obtain an estimate of the multi- distribution of X, e.g., p(x), is approximated in a product 

dimensional independent components by applying thresh- (Equation 15). In particular, if p(x) belongs to one of 

olding on the chistering tree with a threshold level €>0. famiHes, e.g., -j^ then it has an exact represenUtion. 

The second method is also based on bottom-tq) hierarchi- 45 Adescription of mixtures of Gaussianizations will now be 

cal clustering with maximurn linkage; however, the distance jj^g f^^y will be called liiKarly transformed 

between dimension i and j is defined to reflect the depeo- Compound Gaussians, and is cbsely related to Independent 

deocy between X, and x,, sudi as the Hoeefiding statistics. Component Analysis, -^(e) with the restriction that the 

the Blum^fcr-Roscnblatt statistics, or the Kendall's rank transform A-I will be called Cbmpound Gaussians. 

correlation coefficient, as described by H. Wolfe, Nonpars- so Qeariy X-{xa, . . . , x^} is a conq>ound Gaussian if the 

metric Statistical Methods, Wiley, 1973. These statistics are dimensions x^s are independent and each dimension x^ is a 

designed to perform a nonparametric test of the indepen- Gaussian mixture 
dence of two random variables. The threshold level € can be 

chosen according to the significant level of the correspond- , 

ing independence test statistics. 55 ^^^^ = V ^^^oix,, t^.)) 

The above scheme is also ^licable for the purpose of m 
speeding up the iterative Gaussianization method of the 

invention. Once we achieve multidimeDsional independent , . ^ , ^ . .... 

components, we can now mn iterations of Gaussianization ^'^'^ «f * compound Gaussian can be written as: 

on the individual multidimensional independent components 6Q 

instead of the entire dimensions, since Gaussianizing a rr v ^ \ 

lower dimensional variable is computatk)naUy advanta- ^d) = J ] 2j '^w>^^» W^o^ ^f^)) 



geous. 

A description of density estimation via Gaussianization 
according to the invention will now be given. As stated 65 Compound Gaussians are closely related to Gaussian 
above, Gaussianization can be viewed as density estimation. mixtures. Hie definition (equation 16) can be expanded as a 
In particular, our parameterized Gaussianization discussed sum of product terms: 
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Compound Gaussians diagonal covariance Gaussian mix- independent 

hires with I mixture components: We convert u„ into binary variables {u^i, . . . , u„^}: 



10 



Similarly wc define 

Note that a mixture of W^^/mtO 



15 nierefore, 



diagonal Gaussians has Fl I^^H ' | 



(2D + l)q/rf-l 



We obtain the following complete log likelihood 



25 



free parameters. However, in the equivalent definition of the 
compound Gaussians (as per equation iTy, the priors, means L(x; u. z« 9) = ^ y iva 
and variances are constrained in a particular fashion such 
that the number of effective free parameters is merely 



30 

D 



pe compound Gaussian dfetribution is specificaUy 35 "^^i^^tl^^^f^?^^^ 

designed for midtivanate vanables which are independent t-s>«>p, wuiupu«. uk. ouii^naij xuuwu^u 

dimenidon-wise while non-Ganssiaa in each dimension. In 

such cases, modeling as mixtures of diagonal Gaussians ^ ^ | 

would be extremely inefiSdent Q{9. ^ = «. 9 = ^ \ jw^iog^ + 

However, compound Gaussians are oot able to describe ^ —1 ^-i [ 

the dependencies among the dimensions, whereas mixtures 

of diagonal Gaussians can. To accommodate for the ^ 

dependencies, one may consider mixtures of Compound y w^^- logfr*^^ - -lo^ixdi^^ - 

Gaussians, which is a generalization of mixtures of diagonal ^ L ^ 2dJ^j J 

Gaussians. 

Gaussianization leads to a collection of families -^^9 of 
probability densities These densities can be further gener- where 
alized by considering mixtures over -jt- We have already 

seen that compound Gaussians (c-^Q) are not able to o 
describe dependencies between dimensions^ while mixtures ^ Fl Zj '^^^('W^t/tf* ''IaP 

of Compound Gaussians can, illustrating that this generali- 50 ^ _ E^j^y^^^^ - _ 



zation can sometimes be useful v-i o 

Adescrq)tionofanEMmethodfortrainingthe mixture of Aj^'JIi ""^^^^^^^'^^^^^^y^ 

compound Gaussians according to an embodiment of the 

invention will now be given. Let {x^€R^:l=n^N} be the vt \ ^ 

training set; assume they arc indepcndcnQy and identically ^5 *w* = ^"m^m;!^' ) " 
distributed as a mixture of compound Gaussians: ^ 

ATt^n 'i^jccw ' i^^i^ 
^ ^ w 1=1 

/(X) = Y />*n T ffi^s c{xa, ai^j) i ; 

Let (x„eR^, u„eN, z^e^P:n=l, . . . , N) be the complete 

data, where u_ indicates the index of the compound , , ^ , ... 1 ■ n ^- ^ . 

Gaussian, and z^=(z„ , • . . „) are the indexes of the 1° the M.step, we maxmiize the auxiliary function Q to 

particular Gaussian along aU the dimensions. It is clear that update the parameter estimation 

ii„-Multiaomial(l, (Pi. . . . , pi^). O-ug msx^Bfi) 
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where 



28 



1^=1 »pi 



10 >V^=^'^/«M^b''«»^ = 



i'-l 1^1 



i'-l * 



Z Z ^^4y 



15 



* I 



Z Jtv/ffy G{y^, tikf^M . oi'^ / ) 
•-1 



Id the M-step, we maxiiiiize the auxiliary function Q to 
20 update the parameter estimation 

e-azg mazg3(e,e) 



Adescription.wiUnowbcgiveDof mixturcsofcompound We have the foUowing iterative scheme. Note that the 
Gaussians with linear transforms (i.e. -.-Mixtures). Let transform matrix A can be mitiahzcd as the current cstunatc 

' or as an identity matrix. 

(1) Update the compound Gaussian model parameters: 



x^^jcR^ be the random variable we are interested in. 
Assume that after a linear transform A, the transformed 
variable y«Ax can be modeled as a mixture of compound 



Gaussians: 



fiy) 



D ^4 
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We describe here an EM method according to the invention 
to estimate both the transform matrix A and the compound 
Gaussian model parameters. Let {x^R^:l^n^N} be the 
training set. Let {y «Ax„:l^n^N} be the transformed data. 
Let (y^cR^, x„eR*, u„eN, z^eNTm-l, . . . , N) be the ^ 
complete data, where u„ indicates the index of the compound 
Gaussian, and z„Kz,f,i • • • are the indexes of the 
particular Gaussian along all the dimensions. We obtain the 
following complete log likelihood 



N K 

1^1 k=i 



where e=(A, n, /i, a). 

In the E-step» we compute the auxiliary function 



45 



50 
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Z Z Hwjf 

*'=! i^l 



Z Z >»5a^ 

wa| 

Z Z ^nA^ 
JV 

Z Z WiUAi' 



where the transformed data are obtained using the 
current estimate of A: y„»A^. 
(2) Update the tran^orm matrix A We will update A row 
by row. Let the vector a^- be the j-th row of A, let 
c^Ci, • • • » C2>) be the co-factors of A associated with 
the j-th row. Note that a,- and Cy are both row vectors. We 
plug in Q(9, e) the updated (ji, /i, o). 



«=1 d=l k=l i=l 
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where Improved Speech RccogDition", PhJ>. dissertation, John 

Hopkins University, Baltimore, Md., 1997. The semi-tied 

K covaiiance is described by: R. A. Gopinath, in "Constrained 

cr^- V* _!_[ y ^^^.j^r^l Maximum Likelihood Modeling with Gaussian 

^^H^iMi J 5 Distributions", Proc. of DARPA Speech Recognition 

Workshop, February 8-11, Lansdowne, Va., 1998; and M. J. 

^1^^ / H 1 F. Gales, in "Semi-tied Covariance Matrices for Hidden 

^ = V V ^ X**^-^"^ Maricov Models", IEEE Transactions Speech and Audio 

2^/Lj(nAi\„.i ) Processing. Vol. 7, pp. 272-81, 1999. These techniques 

assume each class has its own covariance matrix, however, 

„ ^ ^ , ^ - L diagonal. Both the linear transform and the Gaussian model 

Therefore, the i-lh row of a, can be updated as ^ , * -.-.ii 

» •* ' parameters can be estmiated via maxmium likclihooa as m 



(18), 



15 We consider here a particular nonlinear transform and we 
where ^ can be solved from the quadratic equation (6). optimize the nonlinear transform via maximum likelihood. 

Motivation for our nonlinear transform will now be 
^^c/jy-Vy'+pA/r/^cZ-zr-o. described. Linear techniques such as Heteroscedastic LDA 

. . . and the semi-tied covariance assume diagonal covariance 
A description of feature extraction via Gaussianization ^^^^^ to find a linear transform such that the diagonal 

willnowbcgivea. In the problem of classification, we arc ^ • ^ independent) assumption is more valid. This is 
often interested in extracting a lower dimensional feature described by R. A Gopinath, in "Constrained Maximum 
from the original feature such that the new feature has the Likelihood Modeling with Gaussian Distributions", Proc. of 
lowest dimension possible, but retains as much discrimina- dARPA Speech Recognition Worisshop, February 8-11, 
tive information among the classes as possible. If we have ^ Lansdowne, Va., 1998. We find a transform such that the 
lower dimensional features, we often need less raining data Gaussian assumption is more vaHd, which leads us to 
and we can train more robust classifiers. Gaussianization. 

We now consider feature extraction with the naive Baye- . ^ >^ • • • .i ,i.-«u 

sian classifier. Let xeR^ be the original feature, Ld We ^ to Gaussiamze cadi di™ 
C£{1, L}) be the class label By tteBayes formula can be foUowed by, e.g., 4e semi-hed covariance techmque, 
^11, . . . , L-;; uc im. uiuci. i^y x>^y 30 to find the bcst linear projection. Theoretically, we can wnte 

out the log likelihood with respect to both the Gaussianiza- 
pic\x) = tion parameters and the linear projection parameters. 

However, it is extremely difficult to optimize them for a 
large dataset, such as in automatic ^>eech recognition, 
and we classify x as 35 without loss of gpuCTality, we assume xcR^ It is impor- 

tant to choose the proper parameterization of Gaus^aniza- 
*-fiig max p{4c)p{c) jJqjj gmjjj jjj^j Ijjg maximum likelihood optimization can be 

where p(x|c) is the probability density of the feature vector carried out efficiently. W. parameterize Gaus^^tion as a 
, A . t_ 1 T» • 11 ^ I \ oiecewise linear transform Gaussianization transform. We 

«Ki p(c) isthe pnor probrijiLty for cl^a •IVpieaUy.p(x|c) ^ Pi^^u^k„otsto.o§t,^ . . . S^suchthattoSx^S,^,,. 
can be modeled as a parametric density ^ ^ ^ ^ ^ 

Jtf*i (19) 
where the parameters 6^ can be esrimated from the training ^ * 2 '^^^^^ 

data. We seek a transform T such that the naive Bayesiao 
classifier in the transformed feature space y»T(x) outper- 
forms the naive Baycsian classifier in the original feature ^^^^^ h„ arc the first^rder splines 
space. Let {x^, c,-:lSiSN} be the training data. 

Prior work has focused on linear transforms. In the 
weU-known linear discriminant aiialysis (LDA), we assunie H^^^^^TxIn) 
that the class densities are Gaussian with a common cova- 
riance matrix 



We estimate the transform matrix T via maximum likeli- 



hood 



V (18) 



Notice that 



60 



■^^a^i-a„ if i^ixi^^i 

Recently, there have been generalizations of LDA to 
heteroscedastic cases, e.g., the Heteroscedastic LDA and the We assume univariate Gaussian distribution for the trans- 
semi-tied covariance. The Heteroscedastic LDA is described 65 formed variable y 
by N. Kumar, in "Investigation of Silicon-Auditory Models 

and Generalization of Linear Discriminant Analysis for Xy|oiH'C>Vi»0 
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The log likelihood function 



+ ^/(ff = i)iogc;(4.,/ii.<Tf) 



4 



•i = l)|log|tr;i 



Denote 



Qearly, 



1)0, 



nV=0 



where 



£/(q = l)A«P(„)Aw(^ 



Denote 
We have 

Therefore, we can rewrite the log likelihood as 

w J i 
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A description of various implementations of the methods 
of the present invention will now be given, along with test 
results of the iterative Gaussianization method of the inven- 
tion. 

It is to be understood that the present invention may be 
implemented in various forms of hardware, software, 
firmware, special purpose processors, or a combination 
thereof. Preferably, the present invention is implemented in 
software as an application program tangibly embodied on a 
program storage device. Tb& application program may be 
uploaded to, and executed by, a machine comprising any 
suitable architecture. Preferably, the machine is imple- 
mented OD a computer platform having hardware such as one 
or more central processing units (CPU), a random access 
memory (RAM), and input/outpm (I/O) interface(s). The 
computer platform also includes an operating system and 
micro instruction code. The various processes and functions 
described herein may either be part of the micro instruction 
code or part of the application program (or a combination 
thereof) which is executed via the operating system. In 
addition, various other peripheral devices may be connected 
to the computer platform such as an additional data storage 
device and a printing device. 

It is to be fimher understood that, because some of the 
constituent system components and method steps depicted 
in the accompanying Figures are preferably implemented in 
software, the actual connections between the system com- 
ponents (or the process steps) may differ depending upon the 
manner in wfaicli the present invention is programmed. 
Given the teachings of the present invention provided 
herein, one of ordinary skill in the related art will be able to 
contemplate these and similar implementations or configu- 
rations of the present invention. 

no. 6 is a block diagram of a computer processing 
system 600 to which the present invention may be applied 
according to an embodiment thereof. The computer process- 
ing system includes at least one processor (CPU) 602 
operativdy coupled to other components via a system bus 
604. A read-only memory (ROM) 606, a random access 
memory (RAM) 608, a display adapter 610, an I/O adapter 
612, and a user interface ad^ter 614 are operatively coupled 
to the system but 604 by the I/O adapter 612. 

A mouse 620 and keyboard 622 are operatively coupled to 
the system bus 604 by the \iser interface adapter 614. The 
mouse 620 and keyboard 622 may be used to input/output 
information to/£rom the computer processing system 600. It 
is to be appreciated that oth^ configurations of computer 
processing system 600 may be employed in accordance with 
the present invention while maintaining the spirit and the 
scope thereof. 

The iterative Gaussianization method of the invention, 
both as parametric projection pursuit and as hierarchical 
independent pursuit, can be effective in high dimensional 
stmctuxe mining and high dimensional data visualization. 
Also, most of the methods of the invention can be directly 
applied to automatic ^)eech and speaker recognition. For 
example, the standard Gaussian mixture density models can 
be replaced by mixtures of compound Gaussians (with or 
without linear transforms). 

T^LE 1 



We constrain 






1-d Oaussianization 




Word Error Rate 


18.5% 


18.1% 



i.e. we assume that the Gaussianization transform (as per 65 
equation 19) is strictly increasing. We can solve this using a 
standard numerical optimization package. 



The nonlinear feature extraction method of the present 
invention can be applied to the front end of a speech 
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recognition system. Standard mel-cepstral computation 
involves taking the logarithm of the Mel-binned Fourier 
spectrum. According to an embodiment of the invention, the 
logarithm is replaced by univariate Gaussianization along 
each dimension. The univariate Gaussianization is estimated 
by pooling the entire training data for all the classes. Ikble 
1 illustrates the results of using this technique on the 1997 
DARPA HUB4 Broadcast News Transcription Evaluation 
data using an HMM system with 135K Gaussians. As 
shown, modest improvements (0.4%) are obtained. One 
would expect that the nonlinear feature extraction algorithm 
provided above, which is optimized for multiple population 
would perform better. 

Although the illustrative embodiments have been 
described herein with reference to the accompanying 
drawings, it is to be understood that the present invention is 
not limited to those precise embodiments, and that various 
other changes may be affected therein by one of ordinary 
skill in the art without departing from the scope or spirit of 
the invention. All such changes and modifications are 
intended to be included within the scope of the invention as 
defined by the appended claims. 

What is claimed is: 

1: A method for generating a high dimensional density 
model within an acoustic model for one of a speech and a 
speaker recognition system, the density model having a 
plurality of components, each component having a plurality 
of coordinates oorre^nding to a feature space, the method 
comprising the steps of: 

transforming acoustic data obtained from at least one 

speaker into high dimensional feature vectors; 
fbrmii^ the density model to model the feature vectors by 
a mixture of compound Gaussians with a linear 
transform, wherein each compoimd Gaussian is asso- 
ciated with a compound Gaussian prior and models 
each of the coordinates of each of the conqx>neiits of 
the density model independently by a univariate Gaus- 
sian mixture comprising a univariate Gaussian prior, 
variance, and mean. 

2. The method according to daim 1, further comprising 
the step of applying an iterative expectation maximization 
(£M) method to the feature vectors, to estimate the linear 
transform, the compound Gaussian priors, and the univariate 
Gaussian priors, variances, and means. 

3. The method according to claim 2, wherein the EM 
method comprises the steps of: 

computing an auxiliary function Q of the EM method; 

respectively updating the compound Gaussian priors and 
the univariate Gaussian priors, to maximize the auxil- 
iary fiinction Q; 

respectively updating the univariate Gaussian variances, 
the linear transform row by row, and the univariate 
Gaussian means, to maximize the auxiliary function Q; 

repeating said second updating step, until the auxiliary 
function Q converges to a local maximum; 

repeating said computing step and said second updating 
step, until a log likelihood of the feature vectors con- 
verges to a local maximum. 

4. The method according to claim 3, further comprising 
the step of updating the density model to model the feature 
vectors by the mixture of compound Gaussians with the 
updated linear transform, wherein each of the compotmd 
Gaussians is associated with one of the updated compotmd 
Gaussian priors and models each of the coordinates of each 
of the components independently by the univariate Gaussian 
mixtures comprising the updated univariate Gaussian priors, 
variances, and means. 
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5. The method according to claim 3, wherein the linear 
transform is fixed, when the univariate Gaussian variances 
are updated. 

6. The method according to claim 3, wherein the univari- 
ate Gaussian variances are fixed, when the linear transform 
is updated. 

7. The method according to daim 3, wherein the linear 
transform is fixed, when the univariate Gaussian means are 
updated. 

8. A method for generating a high dimensional density 
model within an acoustic model for one of a speech and a 
speaker recognitioa system, comprising the steps of: 

transforming-acoustic data obtained from at least one 
speaker into high dimensional feature vectors; 

forming the density model to model the feature vectors by 
a mixture of compound Gaussians with a linear 
transform, wherein each compound Gaussian is asso- 
dated with a compound Gaussian prior and models 
each coordinate of each component of the density 
model independently by a imivariate Gaussian mixture 
comprisirig a univariate Gaussian prior, variance, and 
mean; 

applying an iterative expectation maximization (EM) 
method to the feature vectors, comprising the steps of: 

computing an auxiliary function Q of the £M method; 

respectively updating the compound Gaussian priors and 
the imivariate Gaussian priors, to maximize the func- 
tion Q; 

respectively updating the univariate Gaussian variances, 
the linear transform row by row, and the univariate 
Gaussian meai3s, to maximize the function Q; 

repeating said second updating step, until the auxiliary 
function Q converges to a local maximum; 

r^ating said computing step and said second updating 
step, until a log likelihood of the feature vectors con- 
verges to a local maximum; and 

updating the density model to model the feature vectors 
by the mixture of compound Gaussians with the 
updated linear transform, v^erein each compound 
Gaussian is associated with one of the updated com- 
pound Gaussian priors and models each of the coordi- 
nates of each of the components independently by 
univariate Gaussian mixtures comprising the updated 
univariate Gaussian priors, variances, and means. 

9. The method according to daim 8, wherein the linear 
transform is fixed, when the univariate Gaussian variances 
are updated. 

10. The method according to claim 8, wherein the univari- 
ate Gaussian variances are fixed, when the linear transform 
is updated. 

11. The method according to claim 8, wherein the linear 
transform ts fixed, when the univariate Gaussian means are 
updated. 

12. The method according to claim 8, further comprising 
the step of determining the log likelihood of the high 
dimensional acoustic data, prior to said applying step. 

13. The method according to daim 12, wherein the 
auxiliary function Q is computed based upon the log like- 
lihood of the feature vectors. 

14. A program storage device readable by machine, tan- 
gibly embodying a program of instmctions executable by the 
madiine to perform method steps for generating a high 
dimensional density model within an acoustic model for one 
of a speech and a weaker recognition system, said method 
steps comprising: 

fonning the density model to model feature vectors 
obtained from at least one speaker by a mixture of 
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compound Gaussians with a linear traosform, wherein Gaussian is associated with one of the updated coin- 
each compound Gaussian is associated with a com- pound Gaussian piiors and models each of the coordi- 
pound Gaussian prior and models each coordinate of nates of each of the components independently by 
each component of the density model independently by univariate Gaussian mixtures comprising the updated 
a univariate Gaussian mixture comprising a univariate 5 univariate Gaussian priors, variances, and means. 
Gaussian prior, variance, and mean; p^^^^^ ^^^^^^ according to claim 14, 

applying an iterative expectation maximizaUon (EM) therein the linear transform is fixed, when the univariate 

method to the feature vectors, comprising the steps of: ^^^^ yan^cts are updated. 

computing an auxiliary function Q of the EM method; program storage device according to claim 14, 

respectively updating the compound Gaussian priors and wherein the univariate Gaussian variances arc fixed, when 

the univariate Gaussian priors, to maximize the func- the linear transform is updated. 

17, The program storage device according to claim 14, 

respectively updating the univariate Gaussian variances, wherein the linear transform is fixed, when the univariate 

the linear traixsform row by row, and the univariate 15 Gaussian means are updated. 

Gaussian means, to maximize the function Q; \g xhe program storage device according to daim 14, 

repeating said second updating step, until the auxiliary further comprising the step of determining the log likelihood 

function Q converges to a local maximum; of the high dimensional acoustic data, prior to said applying 

repeating said computing step and said second updating step. 

step, until a log likelihood of the feature vectors con- 20 19, jhe program storage device according to claim 18, 

verges to a local maximum; and wherein the auxiliary function Q is computed based upon the 

updating the density model to model the feature vectors log likelihood of the feature vectors. 

by the mixture of compound Gaussians with the 

updated linear transform, wherein each compound * * ♦ ♦ ♦ 
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